What is the unemployment rate?

First, we must know what is the definition of the unemployment rate to understand how it is derived and what factors are related to it. A person is defined as unemployed in the United States if they are jobless, but have looked for work in the last four weeks and are available for work. To record unemployed, Government distributes survey to sampling population and predict the entire unemployed number in a broad area. Measuring the unemployment gives us a good overview of the ongoing status of the economy, international competition, technology development, and so on.

The equation of the Unemployment Rate is

\[unemployment \space rate = \frac{unemployed}{labor \space force}\times100\] where labor force includes all people age 16 and older who are classified as either employed and unemployed.

In this project, I will focus on analyzing and predicting the unemployment rate in the LA County.

Data Cleaning

unprocessed = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/unprocessed_data.csv")
head(unprocessed)
##       DATE unemploy_rate_la avg_price_pipedgas_la avg_price_electr_kwh_La
## 1 1990/1/1              5.9                18.662                   0.105
## 2 1990/2/1              5.6                19.971                   0.108
## 3 1990/3/1              5.4                19.971                   0.108
## 4 1990/4/1              5.5                19.976                   0.108
## 5 1990/5/1              5.4                27.721                   0.107
## 6 1990/6/1              5.4                27.712                   0.108
##   avg_price_gasolone_la civilian_labor_force_la_pch cpi_allitems_la
## 1                 0.957                        #N/A             1.1
## 2                 0.988                         0.2             1.1
## 3                 1.014                         0.1             0.7
## 4                 1.030                        -0.4            -0.2
## 5                 1.080                         0.4             0.3
## 6                 1.103                           0             0.3
##   economics_cond_index_la unemployed_num_pch
## 1                      NA               #N/A
## 2                    1.24               -4.3
## 3                    0.58               -3.8
## 4                   -0.35                  1
## 5                   -0.42               -0.9
## 6                   -0.64               -0.5
##   new_private_housing_structure_issue_la home_price_index_la
## 1                               5121.609                 0.7
## 2                               4648.972                 0.4
## 3                               3628.443                 0.2
## 4                               3833.476                -0.1
## 5                               3466.321                -1.1
## 6                               3496.684                -0.3
##   allemployee_nonfarm_la_pch allemployee_constr_la_pch allemployee_manu_la_pch
## 1                       #N/A                      #N/A                    #N/A
## 2                        0.1                      -1.2                       0
## 3                       -0.1                      -1.1                    -0.4
## 4                       -0.2                      -3.7                    -0.4
## 5                       -0.2                      -1.3                    -0.4
## 6                       -0.3                      -0.9                    -0.7
##   allemployee_finan_la_pch allemployee_leisure_la_pch new_patent_ass_la
## 1                     #N/A                       #N/A               202
## 2                     -0.1                       -0.4               218
## 3                     -0.7                       -0.1               290
## 4                     -0.2                       -0.6               256
## 5                     -0.8                       -0.1               262
## 6                     -0.3                        0.3               242
##   govn_social_insu_pch compen_employee_wage_pch real_disp_inc_per_capital_pch
## 1                  1.6                      0.5                           0.5
## 2                  0.0                      1.2                           0.1
## 3                  1.0                      0.7                          -0.1
## 4                  0.0                      0.9                           0.5
## 5                  0.3                     -0.3                          -0.2
## 6                  1.3                      0.8                           0.0
##   bbk_real_gdp us_interest_rate pers_consum_expen_pch pers_saving_rate
## 1    4.8776608                7                   1.3              8.0
## 2    6.1814509                7                  -0.1              8.6
## 3    2.9195562                7                   0.7              8.3
## 4   -0.5634379                7                   0.4              8.8
## 5    0.7507924                7                   0.2              8.7
## 6    1.1771073                7                   0.8              8.6
##   pers_current_tax_chg govn_social_ben_toperson_pch federal_fund_eff_rate
## 1                 -8.4                          5.3              8.229032
## 2                  8.1                         -0.2              8.237143
## 3                  5.2                          0.7              8.276774
## 4                  4.2                          0.6              8.255000
## 5                  0.6                         -0.5              8.176452
## 6                  3.9                          1.2              8.288667
##   X30_year_fixed_mortgage
## 1                 1.54967
## 2                 3.05710
## 3                 0.69135
## 4                 0.99338
## 5                 1.03664
## 6                -2.99213

Although all variables are supposed to be numeric, but in fact some of them are imported as character variables. Also, we need to deal with missing value in columns contained “PCH” which means “Percent Change”.

unprocessed = unprocessed[-1,]
date = unprocessed[,1]

# delete NA in the first row 
unprocessed = unprocessed[,-1] %>% mutate_if(is.character, as.numeric)

unprocessed$DATE = date
unprocessed = unprocessed %>%
  select(DATE, everything())

# delete variables that have at least 8 missing values 
processed_data = unprocessed %>% select(-avg_price_pipedgas_la, -new_patent_ass_la,
                                     -cpi_allitems_la, -us_interest_rate,
                                     -economics_cond_index_la) %>% 
  head(-2)

# no missing value 
sum(sapply(processed_data, function(x) sum(is.na(x))))
## [1] 0
write.csv(processed_data, "D:\\UCSB\\Spring_2022\\PSTAT 131\\PSTAT_131_HW\\HW2\\PSTAT-131\\Final Project\\data\\processed_data.csv", row.names = FALSE)

EDA

# start with Feb since we delete the first row of unprocessed data
processed = ts(unprocessed[,-1],frequency = 12, start = c(1990,2)) 

autoplot.zoo(processed[,"unemploy_rate_la"])+
  ggtitle("Unemployment Rate in LA County") +
  xlab("Year") +
  ylab("Percentage%")

We discovered that there are a few well-known recession periods from 1990 to now. The collapse of internet bubble, the financial crisis of 2007, and Covid-19 pandemic all matches severe increase of the unemployment rate. We may study how different economics indexes fluctuates during the financial crisis of 2007 which is mainly caused by the mortgage debt.

Now we want to discover the seasonal pattern of the unemployment. Except the fluctuation during the Covid-19 seems abnormal, we discover that the unemployment rate usually peaks in summer. This is a problem requires further research.

ggseasonplot(processed[,"unemploy_rate_la"]) +
  ggtitle("Seasonal Plot of Unemployment Rate in LA County") +
  xlab("Year") +
  ylab("Percentage%")

ggsubseriesplot(processed[,"unemploy_rate_la"]) +
  ylab("$ million") +
  ggtitle("Seasonal subseries plot: LA Unemployment Rate")

Spatial Analysis

Net Migration

census_api_key("7540e4d61b8467521425225cbe8f44f7c1667f9a")
net_migration <- get_estimates(geography = "county", state = "CA",
                               variables = "RNETMIG",
                               year = 2019,
                               geometry = TRUE,
                               resolution = "20m") %>%
  shift_geometry()

order = c("-15 and below", "-15 to -5", "-5 to +5", "+5 to +15", "+15 and up")

net_migration <- net_migration %>%
  mutate(groups = case_when(
    value > 15 ~ "+15 and up",
    value > 5 ~ "+5 to +15",
    value > -5 ~ "-5 to +5",
    value > -15 ~ "-15 to -5",
    TRUE ~ "-15 and below"
  )) %>%
  mutate(groups = factor(groups, levels = order))

state_overlay <- states(
  cb = TRUE,
  resolution = "20m"
) %>%
  filter(GEOID != "72") %>%
  shift_geometry()

ggplot() +
  geom_sf(data = net_migration, aes(fill = groups, color = groups), size = 0.1) + 
  scale_fill_brewer(palette = "PuOr", direction = -1) +
  scale_color_brewer(palette = "PuOr", direction = -1, guide = FALSE)  +
  labs(title = "Net migration per 1000 residents in CA",
       subtitle = "US Census Bureau 2019 Population Estimates",
       fill = "Rate") +
  theme_minimal(base_family = "Roboto")

We found LA county is losing population this 5 years. It requires further discussion how this trend will effect the unemployment rate.

Median Age

#median age
med_age <- get_acs(state = "CA", county = "Los Angeles", geography = "tract", 
                  variables = "B01002_001", geometry = TRUE)
med_age %>%
  ggplot(aes(fill = estimate)) + 
  geom_sf(color = NA) + 
  scale_fill_viridis_c(option = "magma")

Building Up Models

Modeltime model

All modeltime algorithm must include a date-time feature.

model_data = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/processed_data.csv")

I first use data from 1990-2016 to test whether these models can only use previous values of the unemployment rate and date to forecast in a satisfactory accuracy. Further, I want to utilize models which consider other economics indicators. Hopefully, I can create models that can first perceive economics situation in the US and then determine the trend of the unemployment rate.

Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again. When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. Therefore, the main concern may be only to predict what will happen, not to know why it happens.

data = model_data %>%
  mutate(DATE, DATE = as.Date.character(DATE))

data = data %>% select(DATE, contains("la"))

data_2016 = data[1:320,] # I don't want to include pandemic 
data_2017 = data[321:332,]

data_2016 %>% plot_time_series(DATE, unemploy_rate_la)

Splitting the data set and creating the training and testing set

splits <- initial_time_split(data_2016, prop = 0.9)

splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(DATE, unemploy_rate_la)
train_2016 = training(splits)
test_2016 = testing(splits)

Correlation graph

train_2016 %>% select(contains("la")) %>%
  cor() %>%
  corrplot(type = "upper", tl.pos = "td",
         method = "circle", tl.cex = 0.5, tl.col = 'black',
         order = "hclust", diag = FALSE)

Here, we only compare the unemployment rate with economics indexes in LA and discover that the unemployment rate is highly correlated with average price of electricity, average price of gasoline, and the number of new private housing structure issue in LA. Thus, I may focus on these three predictor variables when creating models.

ACF and PACF

\[S_t=\phi_{21}S_{t-1}+\phi_{22}S_{t-2} + \epsilon_t\] ACF plot is used to detect lagged features, fourier series periods, and data features via cycles. A time series may have a relationship to previous versions of itself. These are called lags. Then, the autocorrelation is introduced to measure the strength of the relationship to its lags. I want to use ACF plot to identify lags, which determines the recipe in my machine learning models.

# ACF Diagnostics 
data %>% 
  plot_acf_diagnostics(DATA, unemploy_rate_la, .lags = 100)

From PACF graph, I find that lag 13 is still materially different as time passes. However, I may not utilize a long lag since I only want to predict the unemployment rate 2 or 3 months ahead. Therefore, lag 6 (half of a year) will be included in my lag recipe.

Auto ARIMA

What is ARIMA? What are its parameters?

ARIMA is a simple algorithm that uses linear model to model lags. It performs automated differencing and recursive lag forecasting. Also, we can add fourier function to simulate seasonalities. However, ARIMA is very sensitive to number of lags and forecast can be erratic. Although regularization is not implemented, the parameter search might still cost lots of time.

According to the documentation of Auto ARIMA, it selects parameters based on which ever model yields the best In-sample AIC value. During refitting, if a new parameter set yields a lower AICc value, then the new model is selected.

# Auto ARIMA
model_fit_auto_arima <- arima_reg() %>%
    set_engine(engine = "auto_arima") %>%
    fit(unemploy_rate_la ~ DATE, 
        data = train_2016)

# w/ XREGS
model_fit_auto_arima_events <- arima_reg() %>%
    set_engine(engine = "auto_arima") %>%
    fit(unemploy_rate_la ~ DATE + avg_price_electr_kwh_La + avg_price_gasolone_la +
        new_private_housing_structure_issue_la, 
        data = train_2016)

# Calibrate
calibration_tbl <- modeltime_table(model_fit_auto_arima,
                                   model_fit_auto_arima_events) %>% 
  modeltime_calibrate(test_2016)

# Forecast test
calibration_tbl %>%
  modeltime_forecast(
    new_data = test_2016,
    actual_data = data_2016
  ) %>%
  plot_modeltime_forecast()
# Accuracy Test 
calibration_tbl %>% modeltime_accuracy()
## # A tibble: 2 x 9
##   .model_id .model_desc                .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 ARIMA(4,0,1)(2,1,1)[12]    Test  0.727 11.9   2.19 11.1  0.851 0.643
## 2         2 REGRESSION WITH ARIMA(1,1~ Test  0.333  5.43  1.00  5.24 0.388 0.961

Clearly, after adding three predictor variables in our model, we are able to forecast local peaks in the short time of period. Then, we can keep forecasting beyond 2016 data.By checking the real values, we found that the green line does a great job on forecasting the trend and the seasonality of the unemployment rate in LA county.

# Refit  
refit_tbl <- calibration_tbl %>%
  modeltime_refit(data_2016)

refit_tbl %>% modeltime_forecast(
  new_data = data_2017,
  actual_data = data_2016
  ) %>%
  plot_modeltime_forecast(
    .conf_interval_alpha = 0.05
  )

Prophet

changepoint_range: Adjusts the flexibility of the trend component by limiting to a percentage of data before the end of the time series.
changepoint_num: Number of potential change points to include for modeling trend.

#Prophet 
model_fit_prophet <- prophet_reg(
  changepoint_num = 20,
  changepoint_range = 0.9
) %>%
    set_engine(engine = "prophet") %>%
    fit(unemploy_rate_la ~ DATE, data = train_2016)

model_fit_prophet_xregs <- prophet_reg(
  changepoint_num = 20,
  changepoint_range = 0.9,
  seasonality_yearly = 1
) %>%
    set_engine(engine = "prophet") %>%
    fit(unemploy_rate_la ~ DATE + avg_price_electr_kwh_La + avg_price_gasolone_la +
        new_private_housing_structure_issue_la, data = train_2016)

modeltime_table(
  model_fit_prophet,
  model_fit_prophet_xregs
) %>% modeltime_calibrate(new_data = test_2016) %>%
  modeltime_forecast(
    new_data = test_2016,
    actual_data = data_2016
  ) %>%
  plot_modeltime_forecast(.interactive = TRUE)

The forecast graphs shows that the prophet model wrongly predicts the trend in the testing set. Therefore, we want to visualize the effect of key parameters on the prophet model to find out what happen. No matter what I adjust the change point range and number, the forecst graph does not change.

prophet_model <- model_fit_prophet$fit$models$model_1

prophet_fcst <- predict(prophet_model, 
                        newdata = train_2016)

g <- plot(prophet_model, prophet_fcst) +
  add_changepoints_to_plot(prophet_model) 
  
ggplotly(g)

It seems that prophet model does not discover the change of trend after 2010 and predict the unemployment rate would keep growing. I may hypothesize that the growth of unemployment rate during Mortgage Crisis is so influential that the model ignores the slightly decrease afterward. Without detecting the decrease trend at the end of training set, even the XREGS cannot save the forecast. Although the prophet model may perform well if I extend the training set, this shortcoming makes me abandon Prophet model.

I will further test the Exponential Smoothing

Machine Learning Algorithmm

I want to consider the lag when building the model

data_2016_full <- data_2016 %>%
  bind_rows(
    # add future window
    future_frame(.data = ., .date_var = DATE, .length_out = 1)
  ) %>% 
  # add auto correlated lags 1 months
  tk_augment_lags(unemploy_rate_la, .lags = 1) %>%
  tk_augment_lags(unemploy_rate_la, .lags = 4) %>%
  tk_augment_lags(unemploy_rate_la, .lags = 12) %>%
  tk_augment_lags(contains("la"), .lags = 1) %>%
  tk_augment_lags(contains("la"), .lags = 4) %>%
  tk_augment_lags(contains("la"), .lags = 12)
  

data_2016_full
## # A tibble: 321 x 106
##    DATE       unemploy_rate_la avg_price_electr_kwh_La avg_price_gasolone_la
##    <date>                <dbl>                   <dbl>                 <dbl>
##  1 1990-02-01              5.6                   0.108                 0.988
##  2 1990-03-01              5.4                   0.108                 1.01 
##  3 1990-04-01              5.5                   0.108                 1.03 
##  4 1990-05-01              5.4                   0.107                 1.08 
##  5 1990-06-01              5.4                   0.108                 1.10 
##  6 1990-07-01              6.2                   0.108                 1.13 
##  7 1990-08-01              6.2                   0.108                 1.28 
##  8 1990-09-01              6.3                   0.108                 1.35 
##  9 1990-10-01              6.2                   0.109                 1.42 
## 10 1990-11-01              6.5                   0.11                  1.42 
## # ... with 311 more rows, and 102 more variables:
## #   civilian_labor_force_la_pch <dbl>,
## #   new_private_housing_structure_issue_la <dbl>, home_price_index_la <dbl>,
## #   allemployee_nonfarm_la_pch <dbl>, allemployee_constr_la_pch <dbl>,
## #   allemployee_manu_la_pch <dbl>, allemployee_finan_la_pch <dbl>,
## #   allemployee_leisure_la_pch <dbl>, unemploy_rate_la_lag1 <dbl>,
## #   unemploy_rate_la_lag4 <dbl>, unemploy_rate_la_lag12 <dbl>, ...
new_splits <- initial_time_split(data_2016_full, prop = 0.9)
new_train_2016 <- training(new_splits)
new_test_2016 <- testing(new_splits)

Elastic Net Algorithm

Elastic Net Regression is very good at capturing trends, but we may not use it for complex patterns. Also, elastic net applies regularization to a linear regression

We know a linear regression tend to over-fit if we add many predictors

recipe_spec_base <- recipe(unemploy_rate_la ~., data = new_train_2016) %>%
  step_timeseries_signature(DATE) %>%
  # feature removal
  step_rm(matches("(iso)|(xts)|(hour)|(minute)|(second)|(am.pm)|(day)|(week)")) %>%
  # standardization
  step_normalize(matches("(index.num)|(year)|(issue)")) %>%
  # month feature is converted to dummy variables 
  step_dummy(all_nominal(), one_hot = TRUE) 
  # may later add interaction and fourier series features

recipe_spec = recipe_spec_base %>%
  step_naomit(matches("lag"))

# spline
recipe_spec_1 <- recipe_spec_base %>% 
  step_rm(DATE) %>%
  step_ns(ends_with("index.num")) %>%
  step_rm(matches("lag"))

recipe_spec %>% prep() %>% juice() %>% glimpse()
## Rows: 259
## Columns: 123
## $ DATE                                                   <date> 1992-07-01, 19~
## $ avg_price_electr_kwh_La                                <dbl> 0.120, 0.120, 0~
## $ avg_price_gasolone_la                                  <dbl> 1.393, 1.385, 1~
## $ civilian_labor_force_la_pch                            <dbl> 0.7, -0.2, -0.9~
## $ new_private_housing_structure_issue_la                 <dbl> -0.2055146, -0.~
## $ home_price_index_la                                    <dbl> -0.9, -0.9, -0.~
## $ allemployee_nonfarm_la_pch                             <dbl> -0.2, -0.2, -0.~
## $ allemployee_constr_la_pch                              <dbl> -1.4, -0.9, -0.~
## $ allemployee_manu_la_pch                                <dbl> -0.4, -0.7, -0.~
## $ allemployee_finan_la_pch                               <dbl> -1.1, -1.0, 0.0~
## $ allemployee_leisure_la_pch                             <dbl> 2.0, -0.2, -0.1~
## $ unemploy_rate_la_lag1                                  <dbl> 10.2, 10.6, 10.~
## $ unemploy_rate_la_lag4                                  <dbl> 9.3, 8.9, 9.5, ~
## $ unemploy_rate_la_lag12                                 <dbl> 8.6, 8.4, 8.4, ~
## $ avg_price_electr_kwh_La_lag1                           <dbl> 0.120, 0.120, 0~
## $ avg_price_gasolone_la_lag1                             <dbl> 1.374, 1.393, 1~
## $ civilian_labor_force_la_pch_lag1                       <dbl> 0.9, 0.7, -0.2,~
## $ new_private_housing_structure_issue_la_lag1            <dbl> -0.3499802, -0.~
## $ home_price_index_la_lag1                               <dbl> -1.1, -0.9, -0.~
## $ allemployee_nonfarm_la_pch_lag1                        <dbl> -0.4, -0.2, -0.~
## $ allemployee_constr_la_pch_lag1                         <dbl> -1.5, -1.4, -0.~
## $ allemployee_manu_la_pch_lag1                           <dbl> -0.5, -0.4, -0.~
## $ allemployee_finan_la_pch_lag1                          <dbl> -0.1, -1.1, -1.~
## $ allemployee_leisure_la_pch_lag1                        <dbl> -0.4, 2.0, -0.2~
## $ unemploy_rate_la_lag1_lag1                             <dbl> 9.5, 10.2, 10.6~
## $ unemploy_rate_la_lag4_lag1                             <dbl> 9.5, 9.3, 8.9, ~
## $ unemploy_rate_la_lag12_lag1                            <dbl> 8.2, 8.6, 8.4, ~
## $ avg_price_electr_kwh_La_lag4                           <dbl> 0.120, 0.119, 0~
## $ avg_price_gasolone_la_lag4                             <dbl> 1.215, 1.242, 1~
## $ civilian_labor_force_la_pch_lag4                       <dbl> -0.4, 0.1, -0.1~
## $ new_private_housing_structure_issue_la_lag4            <dbl> -0.4126281, -0.~
## $ home_price_index_la_lag4                               <dbl> -0.7, -0.7, -0.~
## $ allemployee_nonfarm_la_pch_lag4                        <dbl> -0.1, -0.2, -0.~
## $ allemployee_constr_la_pch_lag4                         <dbl> 1.3, -0.5, -0.8~
## $ allemployee_manu_la_pch_lag4                           <dbl> -0.4, -0.6, -0.~
## $ allemployee_finan_la_pch_lag4                          <dbl> 0.3, -0.4, 0.5,~
## $ allemployee_leisure_la_pch_lag4                        <dbl> -0.4, 0.5, -0.7~
## $ unemploy_rate_la_lag1_lag4                             <dbl> 9.5, 9.3, 8.9, ~
## $ unemploy_rate_la_lag4_lag4                             <dbl> 8.3, 8.4, 9.4, ~
## $ unemploy_rate_la_lag12_lag4                            <dbl> 7.6, 7.4, 7.9, ~
## $ avg_price_electr_kwh_La_lag1_lag4                      <dbl> 0.121, 0.120, 0~
## $ avg_price_gasolone_la_lag1_lag4                        <dbl> 1.199, 1.215, 1~
## $ civilian_labor_force_la_pch_lag1_lag4                  <dbl> 0.2, -0.4, 0.1,~
## $ new_private_housing_structure_issue_la_lag1_lag4       <dbl> -0.008603505, -~
## $ home_price_index_la_lag1_lag4                          <dbl> -0.4, -0.7, -0.~
## $ allemployee_nonfarm_la_pch_lag1_lag4                   <dbl> -0.2, -0.1, -0.~
## $ allemployee_constr_la_pch_lag1_lag4                    <dbl> -3.9, 1.3, -0.5~
## $ allemployee_manu_la_pch_lag1_lag4                      <dbl> -0.2, -0.4, -0.~
## $ allemployee_finan_la_pch_lag1_lag4                     <dbl> -1.0, 0.3, -0.4~
## $ allemployee_leisure_la_pch_lag1_lag4                   <dbl> 0.0, -0.4, 0.5,~
## $ unemploy_rate_la_lag1_lag1_lag4                        <dbl> 9.4, 9.5, 9.3, ~
## $ unemploy_rate_la_lag4_lag1_lag4                        <dbl> 8.2, 8.3, 8.4, ~
## $ unemploy_rate_la_lag12_lag1_lag4                       <dbl> 7.6, 7.6, 7.4, ~
## $ avg_price_electr_kwh_La_lag12                          <dbl> 0.117, 0.117, 0~
## $ avg_price_gasolone_la_lag12                            <dbl> 1.126, 1.143, 1~
## $ civilian_labor_force_la_pch_lag12                      <dbl> 0.9, -0.4, -0.1~
## $ new_private_housing_structure_issue_la_lag12           <dbl> -0.015576211, 0~
## $ home_price_index_la_lag12                              <dbl> 0.2, -0.1, -0.4~
## $ allemployee_nonfarm_la_pch_lag12                       <dbl> -0.2, -0.3, -0.~
## $ allemployee_constr_la_pch_lag12                        <dbl> -0.1, -0.5, -1.~
## $ allemployee_manu_la_pch_lag12                          <dbl> -0.7, -0.4, -0.~
## $ allemployee_finan_la_pch_lag12                         <dbl> 0.6, -0.2, -0.6~
## $ allemployee_leisure_la_pch_lag12                       <dbl> -0.1, 0.0, -0.2~
## $ unemploy_rate_la_lag1_lag12                            <dbl> 8.2, 8.6, 8.4, ~
## $ unemploy_rate_la_lag4_lag12                            <dbl> 7.6, 7.4, 7.9, ~
## $ unemploy_rate_la_lag12_lag12                           <dbl> 6.2, 6.2, 6.3, ~
## $ avg_price_electr_kwh_La_lag1_lag12                     <dbl> 0.117, 0.117, 0~
## $ avg_price_gasolone_la_lag1_lag12                       <dbl> 1.173, 1.126, 1~
## $ civilian_labor_force_la_pch_lag1_lag12                 <dbl> 1.0, 0.9, -0.4,~
## $ new_private_housing_structure_issue_la_lag1_lag12      <dbl> 0.145211149, -0~
## $ home_price_index_la_lag1_lag12                         <dbl> 0.6, 0.2, -0.1,~
## $ allemployee_nonfarm_la_pch_lag1_lag12                  <dbl> -0.3, -0.2, -0.~
## $ allemployee_constr_la_pch_lag1_lag12                   <dbl> -1.1, -0.1, -0.~
## $ allemployee_manu_la_pch_lag1_lag12                     <dbl> -0.5, -0.7, -0.~
## $ allemployee_finan_la_pch_lag1_lag12                    <dbl> -0.3, 0.6, -0.2~
## $ allemployee_leisure_la_pch_lag1_lag12                  <dbl> -0.2, -0.1, 0.0~
## $ unemploy_rate_la_lag1_lag1_lag12                       <dbl> 7.9, 8.2, 8.6, ~
## $ unemploy_rate_la_lag4_lag1_lag12                       <dbl> 7.6, 7.6, 7.4, ~
## $ unemploy_rate_la_lag12_lag1_lag12                      <dbl> 5.4, 6.2, 6.2, ~
## $ avg_price_electr_kwh_La_lag4_lag12                     <dbl> 0.118, 0.118, 0~
## $ avg_price_gasolone_la_lag4_lag12                       <dbl> 1.012, 1.066, 1~
## $ civilian_labor_force_la_pch_lag4_lag12                 <dbl> -0.1, -0.3, -0.~
## $ new_private_housing_structure_issue_la_lag4_lag12      <dbl> 0.263781427, -0~
## $ home_price_index_la_lag4_lag12                         <dbl> -1.3, 0.2, 0.2,~
## $ allemployee_nonfarm_la_pch_lag4_lag12                  <dbl> -0.4, 0.0, -0.4~
## $ allemployee_constr_la_pch_lag4_lag12                   <dbl> -2.6, -0.4, 0.0~
## $ allemployee_manu_la_pch_lag4_lag12                     <dbl> -0.8, -0.2, -0.~
## $ allemployee_finan_la_pch_lag4_lag12                    <dbl> 0.0, -0.8, -0.2~
## $ allemployee_leisure_la_pch_lag4_lag12                  <dbl> -0.2, 1.4, -0.2~
## $ unemploy_rate_la_lag1_lag4_lag12                       <dbl> 7.6, 7.6, 7.4, ~
## $ unemploy_rate_la_lag4_lag4_lag12                       <dbl> 6.5, 6.5, 7.3, ~
## $ unemploy_rate_la_lag12_lag4_lag12                      <dbl> 5.4, 5.5, 5.4, ~
## $ avg_price_electr_kwh_La_lag1_lag4_lag12                <dbl> 0.118, 0.118, 0~
## $ avg_price_gasolone_la_lag1_lag4_lag12                  <dbl> 1.108, 1.012, 1~
## $ civilian_labor_force_la_pch_lag1_lag4_lag12            <dbl> 0.9, -0.1, -0.3~
## $ new_private_housing_structure_issue_la_lag1_lag4_lag12 <dbl> 0.050017742, 0.~
## $ home_price_index_la_lag1_lag4_lag12                    <dbl> -1.1, -1.3, 0.2~
## $ allemployee_nonfarm_la_pch_lag1_lag4_lag12             <dbl> -0.5, -0.4, 0.0~
## $ allemployee_constr_la_pch_lag1_lag4_lag12              <dbl> -1.7, -2.6, -0.~
## $ allemployee_manu_la_pch_lag1_lag4_lag12                <dbl> -0.8, -0.8, -0.~
## $ allemployee_finan_la_pch_lag1_lag4_lag12               <dbl> -0.2, 0.0, -0.8~
## $ allemployee_leisure_la_pch_lag1_lag4_lag12             <dbl> -0.6, -0.2, 1.4~
## $ unemploy_rate_la_lag1_lag1_lag4_lag12                  <dbl> 7.3, 7.6, 7.6, ~
## $ unemploy_rate_la_lag4_lag1_lag4_lag12                  <dbl> 6.2, 6.5, 6.5, ~
## $ unemploy_rate_la_lag12_lag1_lag4_lag12                 <dbl> 5.6, 5.4, 5.5, ~
## $ unemploy_rate_la                                       <dbl> 10.6, 10.4, 10.~
## $ DATE_index.num                                         <dbl> -1.374937, -1.3~
## $ DATE_year                                              <dbl> -1.380931, -1.3~
## $ DATE_half                                              <int> 2, 2, 2, 2, 2, ~
## $ DATE_quarter                                           <int> 3, 3, 3, 4, 4, ~
## $ DATE_month                                             <int> 7, 8, 9, 10, 11~
## $ DATE_month.lbl_01                                      <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_02                                      <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_03                                      <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_04                                      <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_05                                      <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_06                                      <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_07                                      <dbl> 1, 0, 0, 0, 0, ~
## $ DATE_month.lbl_08                                      <dbl> 0, 1, 0, 0, 0, ~
## $ DATE_month.lbl_09                                      <dbl> 0, 0, 1, 0, 0, ~
## $ DATE_month.lbl_10                                      <dbl> 0, 0, 0, 1, 0, ~
## $ DATE_month.lbl_11                                      <dbl> 0, 0, 0, 0, 1, ~
## $ DATE_month.lbl_12                                      <dbl> 0, 0, 0, 0, 0, ~
recipe_spec_2 <- recipe_spec_base %>%
  step_rm(DATE) %>%
  step_naomit(matches("lag"))
model_spec_glmet <- linear_reg(
  mode = "regression",
  penalty = 0.01,
  mixture = 0.5
) %>%
  set_engine("glmnet") 

# spline 
wflw_fit_glmnet_spline <- workflow() %>%
  add_model(model_spec_glmet) %>%
  add_recipe(recipe_spec_1) %>%
  fit(new_train_2016)

# lag
wflw_fit_glmnet_lag <- workflow() %>%
  add_model(model_spec_glmet) %>%
  add_recipe(recipe_spec_2) %>%
  fit(new_train_2016)
calibration_tbl_2 <- modeltime_table(
  wflw_fit_glmnet_spline,
  wflw_fit_glmnet_lag
) %>%
  update_model_description(1, "GLMNET - Spline") %>%
  update_model_description(2, "GLMNET - Lag") %>%
  modeltime_calibrate(new_test_2016)

calibration_tbl_2 %>% modeltime_accuracy()
## # A tibble: 2 x 9
##   .model_id .model_desc     .type   mae  mape   mase smape  rmse   rsq
##       <int> <chr>           <chr> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1         1 GLMNET - Spline Test  5.36  86.5  16.1   55.9  5.89  0.264
## 2         2 GLMNET - Lag    Test  0.143  2.21  0.431  2.19 0.173 0.984
calibration_tbl_2 %>%
  modeltime_forecast(
    new_data = new_test_2016,
    actual_data = data_2016_full
  ) %>%
  plot_modeltime_forecast(.conf_interval_show = FALSE)
library(earth)
# Multivariate Adaptive Regression Spline model 
model_spec_mars <- mars(mode = "regression") %>%
    set_engine("earth") 

#recipe_spec <- recipe(unemploy_rate_la ~ DATE, data = new_train_2016) %>%
    #step_date(DATE, features = "month", ordinal = FALSE) %>%
    #step_mutate(date_num = as.numeric(DATE)) %>%
    #step_normalize(date_num) %>%
    #step_rm(DATE)

wflw_fit_mars_spline <- workflow() %>%
  add_model(model_spec_mars) %>%
  add_recipe(recipe_spec_1) %>%
  fit(new_train_2016)

wflw_fit_mars_lag <- workflow() %>%
  add_model(model_spec_mars) %>%
  add_recipe(recipe_spec_2) %>%
  fit(new_train_2016)
  
wflw_fit_mars_simple <- workflow() %>%
    add_recipe(recipe_spec) %>%
    add_model(model_spec_mars) %>%
    fit(new_train_2016)

calibration_tbl_3 <- modeltime_table(
  wflw_fit_mars_spline,
  wflw_fit_mars_lag,
  wflw_fit_mars_simple
)

calibration_tbl_3 %>%
  modeltime_forecast(
    new_data = new_test_2016,
    actual_data = data_2016_full
  ) %>%
  plot_modeltime_forecast(.conf_interval_show = FALSE)

Add fitted models to a Model Table

models_tbl <- modeltime_table(
    #model_fit_arima_no_boost,
    model_fit_arima_boosted,
    model_fit_ets,
    model_fit_prophet,
    model_fit_lm,
    wflw_fit_mars
)

models_tbl
## # Modeltime Table
## # A tibble: 5 x 3
##   .model_id .model     .model_desc                              
##       <int> <list>     <chr>                                    
## 1         1 <fit[+]>   ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS
## 2         2 <fit[+]>   ETS(A,AD,A)                              
## 3         3 <fit[+]>   PROPHET                                  
## 4         4 <fit[+]>   LM                                       
## 5         5 <workflow> EARTH

Calibrate the model to a testing set

calibration_tbl <- models_tbl %>%
    modeltime_calibrate(new_data = testing(splits))
calibration_tbl
## # Modeltime Table
## # A tibble: 5 x 5
##   .model_id .model     .model_desc                        .type .calibration_da~
##       <int> <list>     <chr>                              <chr> <list>          
## 1         1 <fit[+]>   ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOS~ Test  <tibble>        
## 2         2 <fit[+]>   ETS(A,AD,A)                        Test  <tibble>        
## 3         3 <fit[+]>   PROPHET                            Test  <tibble>        
## 4         4 <fit[+]>   LM                                 Test  <tibble>        
## 5         5 <workflow> EARTH                              Test  <tibble>
calibration_tbl %>%
    modeltime_forecast(
        new_data    = testing(splits),
        actual_data = data_2016
    ) %>%
    plot_modeltime_forecast(
      .legend_max_width = 25
    )
calibration_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = FALSE
    )
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS Test 0.72 12.68 2.18 11.22 1.01 0.67
2 ETS(A,AD,A) Test 0.60 10.02 1.82 9.35 0.74 0.85
3 PROPHET Test 6.37 100.78 19.17 64.20 6.60 0.52
4 LM Test 2.35 38.82 7.07 30.58 2.67 0.00
5 EARTH Test 0.64 10.57 1.94 9.80 0.75 0.94

It seems that

  1. need to tune hyper-parameter
  2. need cross validation
  3. Analyze the ARIMA models
  1. Differencing and White Noise
  2. acf() and pacf()
  3. KPSS Unit Root Test
  1. need to explain intuition of these functions

Random Forest

Neural Network